Abstract: The Map Reduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small- and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. The proposed new MapReduce system, which aims to improve Map Reduce performance using efficient memory management. The parallel multi-buffer technique to balance data production from CPU and data consumption of disk I/O’s, which implements the non-blocking I/O. The parallel also caches the final merged files output by Map tasks in memory to avoid re-reading them from disks before transferring them to remote reduce tasks. All Map/Reduce tasks in a physical node run inside the execution engine, and therefore in a multi JVM, which is one of the key architectural differences between multi JVM and Hadoop. A multi-threaded execution engine, which is based on Hadoop but runs in a multi JVM on a node. In the execution engine, we have implemented the algorithm of hyper scheduling to job assignment, such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the MJVM. We have conducted extensive experiments to compare parallel Mammoth with scheduling algorithm against the native Hadoop platform. The results show that the modified mammoth system can reduce the job execution time by more than 80 percent in typical cases, without requiring any modifications of the Hadoop programs. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the parallel Mammoth system can have a promising potential and impact.
Keywords: Map Reduce, HPC, CPU, Mammoth, Hadoop, HDFS, MJVM.